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Abstract. Let A he a, n by n matrix. A skeleton decomposition is any factorization of the form CUR where C 
comprises columns of A, and R comprises rows of A. In this paper, we consider uniformly sampling 
£ ~ fclogn rows and columns to produce a skeleton decomposition. The algorithm runs in 0{£'') 
time, and has the following error guarantee. Let ||-|| denote the 2-norm. Suppose A ~ XBY^ where 
X,Y each have k orthonormal columns. Assuming that X,Y are incoherent, we show that with 
high probability, the approximation error ||^ — Cf/i?|| will scale with {n/£) \\A — XBY'^\\ or better. 
A key step in this algorithm involves regularization. This step is crucial for a nonsymmetric A as 
empirical results suggest. Finally, we use our proof framework to analyze two existing algorithms in 
an intuitive way. 
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1. Introduction. 

1.1. Skeleton decompositions. This paper is concerned with the decomposition known 
as the matrix skeleton^, pseudo-skeleton [22], or CUR factorization [29, 18]. 

For the rest of the paper, we adopt the following Matlab-friendly notation: given A G 
C"*^", A-^c will denote the restriction of A to columns indexed by C, and Aji: will denote the 
restriction of A to rows indexed by R. A skeleton decomposition of A is any factorization of 
the form 

A.cZAr, for some Z G C''''''. 

In general, a rank-fc approximation of A takes up 0{{m + n)k) space. The major advan- 
tage of the skeleton decomposition is that it may require only 0{k'^) space. Consider the case 
where A^s entries can be specified by a function in closed form. Then the skeleton decom- 
position is fully specified by Z G C'^^'^, as well as the two index sets C and R. In addition, 
row and columns from the original matrix may carry more physical significance than linear 
combinations thereof. 

There are important examples where the full matrix itself is not low rank but can be parti- 
tioned into blocks each of which has low numerical rank. One example is the Green's function 
of elliptic operators with mild regularity conditions [4]. Another example is the amplitude 
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^The term "skeleton" may refer to other factorizations. 

Instead of A ~ Ac.ZA.u, we can have A ~ Z\ArcZi where Z\,Z2 are arbitrary mx k and k x n 
matrices [14]. As 0{mk + nk) space is needed to store Zi, Z2, this representation does not seem as appealing. 
Nevertheless, it is numerically more stable and has found several applications [25]. 

When A — MBN where M, B, N are nx n matrices, we can approximate M as M.cP, N as DNr; , where 
Mc has k columns of M and A^^ has k rows of A''. Thus, A ~ Mc{PBD)Nr, effectively replacing B with the 
k X k matrix B := PBD. Bremer calls B a skeleton and uses it to approximate scattering matrices [7]. 
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Algorithm 1. Skeleton via uniform sampling, 0{k^) algorithm. 
Input: A matrix A G C"*^", and user-defined parameters i and 6. 
Output: A skeleton representation with £ rows and i columns. 
Steps: 

1. Uniformly and independently select i columns to form A-c- 

2. Uniformly and independently select £ rows to form Aft:- 

3. Compute the SVD of Ajic- Remove all the components with singular values 
less than 6. Treat this as a perturbation E A such that ||£'rc'|| < 6 and 
\\{Arc + Erc)+\\<S~\ 

4. Compute Z = {Arc +Erc)+. (In Matlab, Z=pinv(A(R,C) , delta).) 

5. Output A-cZAr;. 



factor in certain Fourier integral operators and wave propagators [11, 16]. Algorithms that 
compute good skeleton representations can be used to compress such matrices. 

1.2. Overview. In this paper, we consider uniformly sampling i ~ fclogmax(m,?i) rows 
and columns of A to build a skeleton decomposition. After obtaining A-c^^R: this way, we 
compute the middle matrix Z as the pseudoinverse of Arc with thresholding or regularization. 
See Algorithm 1 for more details. Suppose A ~ XiAuY^ where Xi,Yi have k orthonormal 
columns. Assuming that Xi,Yi are incoherent, we show in Theorem 1.2 that the 2-norm 
approximation error ||^ — A-cZAr-\\ is bounded by — Xi74iiy|^|| (mn)^/^/^) with high 

probability. We believe that this is the first known relative error bound in the 2-norm for a 
nonsymmetric A. 

The idea of uniformly sampling rows and columns to build a skeleton is not new. In 
particular, for the case where A is symmetric, this technique may be known as the Nystrom 
method^. The skeleton used is A-c^cc^^!-^ which is symmetric, and its 2-norm error is 
recently analyzed by Talwalkar [37] and Gittens [21]. Both papers make the same assumption 
that Xi,Yi are incoherent. In fact, Gittens arrives at the same relative error bound as us. 

Nonetheless, our results are more general. They apply to nonsymmetric matrices that are 
low rank in a broader sense. Specifically, when we write A ~ XiAnY^ , An is not necessarily 
diagonal and Xi, Yi are not necessarily the singular vectors of A. This relaxes the incoherence 
requirement on Xi,Yi. Furthermore, in the physical sciences, it is not uncommon to work 
with linear operators that are known a priori to be almost diagonalized by the Fourier basis 
or related bases in harmonic analysis. These bases are often incoherent. One example is an 
integral operator with a smooth kernel. See Section 4 for more details. 

The drawback of our results is that it requires us to set an appropriate regularization 
parameter in advance. Unfortunately, there is no known way of estimating it fast, and this 
regularization step cannot be skipped. In Section 4.1, we illustrate with a numerical example 
that without the regularization, the approximation error can blow up in a way predicted by 
Theorem 1.2. 



^In machine learning, the Nystrom method can be used to approximate kernel matrices of support vector 
machines, or the Laplacian of affinity graphs, for instance. 
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Finally, we also establish error bounds for two other algorithms. We shall postpone the 
details. 

1.3. Some previous work. Before presenting our main results, we like to provide a sample 
of some previous work. The well-informed reader may want to move on. 

The precursor to the skeleton decomposition is the interpolative decomposition [14], also 
called the column subset selection problem [20]. An interpolative decomposition of A is the 
factorization A-qD for some D. The earliest work on this topic is probably the pivoted QR 
algorithm by Businger, Golub [8] in 1965. In 1987, Chan [12] introduced the Rank Revealing 
QR (RRQR); a sequence of improved algorithms and bounds followed [19, 26, 31, 23]. RRQR 
algorithms can also be used to compute skeletons [28]. 

An early result due to Goreinov et al. [22] says that for any A € ([^mxn^ there exists a 
skeleton A-c^Ar- such that in the 2-norm, || A — || = 0{\/k{^/m + ^/n)ak+l{A)). 

Although the proof is constructive, it requires computing the SVD of A, which is much more 
costly than the algorithms considered in this paper. An useful idea here is to maximize the 
volume or determinant of submatrices. This idea is also used to study RRQRs [13], and may 
date back to the proof of Auerbach's theorem [35], interpolating projections [33] etc. 

One way of building a skeleton is to iteratively select good rows and columns based on 
the residual matrix. This is known as Cross Approximation. As processing the entire residual 
matrix is not practical, there are faster algorithms that operate on only a small part of the 
residual, e.g.. Adaptive Cross Approximation [2, 3], Incomplete Cross Approximation [39]. 
These methods are used to compress off-diagonal blocks of "asymptotically smooth" kernels. 
The results of this paper will also apply to such smooth kernels. See Section 4.2. 

Another way of building a skeleton is to do random sampling, which may date back to [20] . 
One way of sampling is called "subspace sampling" [29, 18] by Drineas et al. If we assume 
that the top k singular vectors are incoherent, then a result due to Rudelson, Vershynin [34] 
implies that uniform sampling of rows and columns, a special case of "subspace sampling" , 
will produce a good skeleton representation A-c{A'^^AA'^.)Ar-. However, it is not clear how 
the middle matrix A'^'^jAA'^. can be computed in sublinear time. In this paper, the skeleton 
we analyze resembles A-cA'^fjAn-, which can be computed in 0{k^) time. 

1.4. Preliminaries. The matrices we consider in this paper take the form 

where X = (Xi X2) and Y = (Yi Y2) are unitary matrices, with columns being ^^spread" , 
and the blocks A12, A21 and A22 are in some sense small. By "spread", we mean 0(l)-coherent. 

Definition 1.1. Let X G C"-^*^ be a matrix with k orthonormal columns. Denote 11^^!^^^^ = 
maxjj \Xij\. We say X is ^-coherent if \\X\\^^^ < (^/n)^/^. 

This notion is well-known in compressed sensing [10] and matrix completion [9, 30]. 

To formalize "small", let := ( ? ), and consider that 



^ A21 A22 ^ 

£k '.= \\^k\\ is small. (1-2) 
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By tolerating an error in the 2-norm, A can be represented using only 0{k^) data. Note 
that Sk is equivalent to max up to constants. To prevent clutter, we have 

suppressed the dependence on k from the definitions of Xi,Yi, An, A12 etc. 

If (1.1) is the SVD of A, then = ak+i{A). It is good to keep this example in mind as 
it simplifies many formulas that we see later. 

An alternative to is 

m n 

i=i j=i 

In other words, is the norm of reshaped into a vector. We know efc < < mnek- 
The reason for introducing is that it is common for (Afc)jj to decay rapidly such that 
^ mnSk- For such scenarios, the 2-norm error may grow much slower with m, n. 

1.5. Main result. Our main contribution is the error analysis of Algorithm 1. The prob- 
lem with the algorithm is that it does not always work. For example, if j4 = XiAnY^ and 
Xi = {^''q''), then Aji- is going to be zero most of the time, and so is A-c^^R:- Hence, it 
makes sense that we want Xi^n^ to be "as nonsingular as possible" so that little information is 
lost. In particular, it is well-known that if Xi,Yi are 0(l)-coherent, i.e., spread, then sampling 
£ = 0{k) rows will lead to ^i,_R:,yi,C: being well-conditioned^. 

Here is our main result. It is proved in Section 2. 

Theorem 1.2. Let A he given by (1-1) for some k > 0. Assume m > n and Xi,Yi are 
fi-coherent. Recall the definitions of ek,£'f. in (1-2) and (1-3). Let I > 10/uA;logm and A = 

^"^"•^ ^ . Then with probability at least 1 — 4/cm~^, Algorithm 1 returns a skeleton that satisfies 

\\A-A,cZAr.,\\ =0{\5 + Xek + £l\/5). (1.4) 
// the entire X and Y are ^-coherent, then with probability at least 1 — Am~^ , 

\\A- A.,cZAR.,\\=0{\5 + e'k + e'k/{X5)). (1.5) 

Minimize the RHS of (1.4) and (1.5) with respect to 5. For (1.4), pick 6 = Q{£k) so that 

\\A - A.,cZAr.,\\ = OiskX) = 0{ek{mn)^'^/t). (1.6) 

For (1.5), pick 5 = e(e'^/A) so that 

\\A-A.,cZAR.,\\=0{e'k). (1.7) 

Here are some possible scenarios where e'^, = o{£kX) and (1.7) is much better than (1.6): 
• The entries of decay exponentially or there are only 0(1) nonzero entries as m, n 
increases. Then e'^ = Q{ek)- 

■^The requirement that ~ 0{n^^^'^) can be relaxed in at least two ways. First, all we need is that 

maxi(5]]^ 1^)^''^ ^ {l^k/^Y^"^ which would allow a few entries of every row of Yi to be big. Second, if 

only a few rows of Y violate this condition, we are still fine as explained in [1]. 
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• Say n = m and (1.1) is the SVD of A. Suppose the singular values decay as m 
Then 4 = Oietm^^^). 

One important question remains: what is e^? Unfortunately, we are not aware of any 
0{k^) algorithm that can accurately estimate e^. Here is one possible heuristic for choosing 
S for the case where (1.1) is the SVD. Imagine Arc — ^^^^ 
singular values of ^i,/?:, 5^i,C: ^■I'e likely to be on the order of (£/m)^/^, (£/n)^/^. Therefore, 
we may pretend that = crk+i{A) ~ X(Jk+i{ARc). 

Another approach is to begin with a big S, run the 0{k'^) algorithm, check ||^ — A-c^Ar- || , 
divide 5 by two and repeat the whole process until the error does not improve. However, cal- 
culating — A-cZAr-W is expensive and we will need other tricks. This is an open problem. 

The 0(/c^) algorithm is probably the fastest algorithm for computing skeleton representa- 
tions that one can expect to have. If we do more work, what can we gain? In Section 3, we 
sketch two such algorithms. The first one samples ^ ~ log m rows, columns, then reduce it 
to exactly k rows, columns using RRQR, with a 2-norm error of Oi^ekimkY^"^). It is similar 
to what is done in [5] . In the second algorithm, we uniformly sample £ ~ A; log m rows to get 
An-, then run RRQR on Ar- to select k columns of A. The overall error is 

0(efc(mn)i/2). 

This is similar to the algorithm proposed by Tygert, Rokhlin et al. [28, 40]. 

Using the proof framework in Section 2, we will derive error estimates for the above two 
algorithms. Our results are not new, but they work for a more general model (1.1) and the 
proofs are better motivated. We will also compare these three algorithms in Section 3.3. 

1.6. More on incoherence. Haar unitary matrices are 0(l)-coherent [17, Theorem VIII. 1], 
that is if we draw X, Y uniformly from the special orthogonal group, it is very likely that they 
are incoherent. This means Algorithm 1 will work well with £ = 0{k). 

If either X or K is not 0(l)-coherent, we can use an old trick in compressed sensing [1]: 
multiply them on the left by the unitary Fourier matrix with randomly rescaled columns. 
This has the effect of "blending up" the rows X,Y. The following is an easy consequence 
of Hoeffding's inequality, and the proof is omitted. 

Proposition 1.3. Let X G C"'^'^ with orthonormal columns. Let D = diag(di, . . . ,d„) where 
di,...,dn are independent random variables such that Kdi = and \di\ = 1. Let T he the 
unitary Fourier matrix and (jl = alogn for some a > 0. Define U := TDX. Then \\U\\^^^ < 
(^/n)^/^ with probability at least 1 — 2(n/c)n~^" 

In words, no matter what X \s, U = TDX would be 0(l)-coherent whp. Hence, we can 
write a simple wrapper around Algorithm 1. 

1. Let B := TD2ADiJ^ where Di,D2 are diagonal matrices with independent entries 
that are ±1 with equal probability. 

2. Feed B to the 0{k^) algorithm and obtain B ~ B-cZBr-. 

3. It follows that A ~ {ADiFI.)Z{Fb:D2A). 

The output is not a skeleton representation, but the amount of space needed is 0{n) + 0{k'^) 
which is still better than 0{nk). However, the algorithm may no longer run in sublinear time. 

2. Error estimates for 0{k'^) algorithm. 

2.1. Notation. Sc,Sr G (j^nxfc ^^^^ both column selector matrices. They are column 
subsets of permutation matrices. The subscripts "i? :" and ": C" denote a row subset and a 
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column subset respectively, e.g., Ar: = S]^A and A^c = ASc, while Arc is a row and column 
subset of A. Transposes and pseudoinverses are taken after the subscripts, e.g., A^. = (Ar:)'^ . 

2.2. Two principles. Our proofs are built on two principles. The first principle is well- 
known in compressed sensing, and dates back to Rudelson [34] in 1999. Intuitively, it says: 



Let Y he anx k matrix with orthonormal columns. Let Yc-. be a random row subset 
of Y. Suppose Y is /i-coherent with fi = 0(1), and \C\ = £ > fik. Then with high 
probability, (n/iy^'^Yc: is like an isometry. 



To be precise, we quote [38, Lemma 3.4]. (Note that their M is our ^k.) 

Theorem 2.1. Let Y G ([^nxk y^j^i]^ orthonormal columns. Suppose Y is ^-coherent and 
I > akfi for some a > 0. Let Yq-. be a random i-row subset ofY. Each row ofYc-. is sampled 
independently, uniformly. Then 



n 



(1 



< k 



{1-6) 



1-5 



for any 5 G [0, 1) 



and 



n 



(i + ^O^^''' 

If <5 = 0.57 and 6' = 0.709 and 1 > Wkfilogn, then 



for any 6' > 0. 



Y+\\ < 1.53(n/^)^/2 and ||yc:|| < 1.31(£/n)^/2'\ > i _ 2kn 



(2.1) 



We will use (2.1) later. Let us proceed to the second principle, which says 



Let C be a nonrandom index set. If ||A:c|| is small , then \\A\\ is also small, provided 
that we have control over ||^l2|| and Y^^-,. for some unitary matrix (Yi Y2). 



The motivation is as follows. If we ignore the regularization step, then what we want to 
show is that A ~ A-cA~iic^R-- ■ when we take row and column restrictions on both sides, 
we have trivially Ajic = ^iJC^/jc^iJC- Hence, we desire a mechanism to go backwards, that 
is to infer that := A — A-cA'^fjAji- is small" from ^^Erq is small." We begin by inferring 
that is small" from ''E.,c is small" . 

Lemma 2.2. Let A G C™^" and Y = {Yi I2) G C"^" be a unitary matrix such that Yi 
has k columns. Select £ > k rows ofYi to form Y\^C: = S'qY\ G C^^'^'. Assume Yi^c-. has full 
column rank. Then 



\A\\ < 



Y 



1,C: 



\A:C\\ + 



Y 



1,C: 



\AY2Y^C:\\ + \\AY2\\- 
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Proof. Note that Y^(j.Y^(t. = hxk- Now, 
\\A\\ < \\AYi\\ + \\AY2\\ 



AY^Ylc-Ylc 



C: 



+ \\AY2 



< \\AYiY^Sc\ 



Y 



1,C: 



< \\{A-AY2Y^)Sc 



< \\A. 



c\ 



Y 



1,C: 



+ \\AY2Y^'^c': 



+ \\AY2\\ 

+ \\AY2\ 



1,C: 
rT 



Y, 



l,C: 



+ \\AY2 



Lemma 2.2 can be extended in two obvious ways. One, we can deduce that is smah if 
Ar- is smah." Two, we can deduce that "A is small if Arc is small." This is what the next 
lemma says. The proof is a straightforward modification of the proof of Lemma 2.2 and shall 
be omitted. 

Lemma 2.3. Let A G C™^" and X = {Xi X2) G C'"''™ and Y = {Yi Y2) € C"^" he 
unitary matrices such that Xi , Yi each has k columns. Select i > k rows and columns indexed 
by R,C respectively. Assume Xi^ji-,Yi^C: have full column rank. Then 



and 



\A\\ < 



X 



1,R: 



\\Ar:\\ + 



X 



1,R: 



X2bXJA\\ + \\XiA\ 



\A\\ < 









\\Arc\\ + 




^l,R: 




Y+ 


\\X2,R:XjAYiY^C-\ 


+ 


^hR: 




Y+ 


\\Xi,R.XfAY2Y^C:\ 


+ 








\\^2,R:X2 AY2Y^C: | 


+ 


^tR: 


\\Xl,R: 


XfAY2\\ + 





\X^AY,Y,^C:\\ + 



mAY2\ 



We conclude this section with a useful corollary. It says that if PA-c is a good low 
rank approximation of A-c for some P G ([^mxm^ then PA may also be a good low rank 
approximation of A. 

Corollary 2.4. Let A G C™^" and P G C™^™. Let Y = (Yi Y2) be a unitary matrix such 
that Yi has k columns. Let Y\^c-_ = S'^Yi G C^^^ where £ > k. Assume Yi^c-. has full column 
rank. Let I G C™^™ be the identity. Then 



\A- PAW < 



Y 



1,C: 



1^. 



c 



PA..c\\ + 



Y 



1,C: 



/- P|| 11^^2^2/7:11 + 11-^- \\AY2 
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In particular, if P is the orthogonal projection A-cA^^, then 



\A - Ac^ic^ll ^ 



1,C: 



\AY2Y^C:\\ + \\AY2\ 



(2.2) 



Proof. To get the first inequality, apply Lemma 2.2 to A — PA. The second inequality is 
immediate from the first inequality since \\A-c ~ A-cA^^A-cW = 0. ■ 

For the special case where X, Y are singular vectors of A, (2.2) can be proved using the 
fact that 11^ — A:r7^!t^^ll = minjj 11^4 — ^^c'L'H and choosing an appropriate D. See [5]. 



A.,cA+.A\ 

(2.2) can be strengthened to ||^ — A:(7^;^74||^ < 



AY2Y:^r..Y 



2,CM,C: 



+ 1 1 AY2 II , by modifying 



the first step of the proof of Lemma 2.2 from ||^|| < HAYiH + ||^l2|| to 



\Af < 



l^^ill + 



11^45^211 • A similar result for the case where X,Y are singular vectors can be found in [24]. 
The value of our results is that it works for a more general model (1.1). 



2.3. Proof of Theorem 1.2. Let Ax = (m/^)^/^ and Ay = [n/tf/'^. To prove the first 
part of Theorem 1.2, that is (1.4), apply the first principle or Theorem 2.1. From (2.1), it 



1,/?: 



0(Ax) hold 



is clear that ||yi,C:|| = O(Ay^), Y^^^. = 0(Ay), \\Xi^R:\\ = 0(A3^^), X 
simultaneously with probability at least 1 — 4/cm~^. 

For the second part of Theorem 1.2, that is (1.5), we need to refine (1.1) as follows. Let 

; Y\n/k'\-l 



X = (Xi, . . . , ) and y = (Yi, . . . , Y^n/k^ ) where Xi, . . . , X^^i^-^-i and Yi, 



has k columns, and Xp^/^i , l^pn/fc] have < k columns. Note that Xi 
where Xi, Yi,^ii are defined in (1.1). Rewrite (1.1) as 



11 



A 



11 



A = (Xi, . . . ,Xpm/fc-|) 



11 



Al,\n/k'] 



A 



\m/k'\ , [n/fc] 



/ ^1 \ 



(2.3) 



By applying Theorem 2.1 to every Xj, Yj and doing a union bound, we see that with probability 



at least 1 
X 



4m ^, we will have ||i^-,C:|| = 0(Ay"'^), 



Y, 



j,C: 



0(Ai 



IX; 



i,R: 



0(A 



X 



0{Xx) for all i,j. 
The rest of the proof is deterministic. 

Suppose the SVD of Arc is U diag{a)V'^. Define another vector a' such that a'^ = —ai 
if Ui < 6, zero otherwise. Let F = U diag{a')V'^ . The skeleton decomposition we return is 
A-c{Arc + F)~^Afi-. The point of adding F to Arc is to avoid "inverting" singular values 
that are less than 6. 

Define the i x i matrix E such that Erc = F and all other entries are zeros. Now, let 
B = A + E so that Brc = Arc + F. The skeleton returned is Ac:B~I^^Ar-. By construction. 



\A- B\\ < S and LB 



RC\ 
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Our objective is to bound \\A — A-cB'I^^Aji-W. Split this up by a perturbation argument. 
II A - A.,cB+cAr.,\\ < \\A -B\\ + \\B- B.,cB+cBr.,\\ + 

\\^-cBrc^R- ~ AcBrjjBr^W + \\A;cB^fjBji; - A:cB^fjAji:\\ 
<5+\\B-B.,cB+cBr:\\ + 

\\{B - A)Sc\\ McBr:\\ + \\AcB+c\\ IKiB - A)\\ 
<5+\\B-B,cB+cBr:\\ + 

^ \\KcBr:\\ + {\\B:cB^c\\ + WAcB^c - B:cB^c\\)S 
<6+\\B - B.,cB+cBr:\\ + 

S \\B+cBr:\\ + S \\B:cB+c\\ + B)Sc\\ S-'S 

<26+\\B-B.,cB^cBR:\\+6\\B+cBR:\\+d\\B.,cB+c\\ (2.4) 

It remains to bound ||-B — B-cB^qBr-\\ , \\B^f^Bji:\\ , ||i3:ci3^^|| using the second princi- 
ple. Intuitively, 1 1 -B]^(-,i?R: 1 1 cannot be too much bigger than ||i?^^i?/jc'|| < 1. 
By Lemma 2.2, 



< 



l,C: 



\Btf;BRc\\ + 



Yi,c: \\KcBR-y2Y,',c]\ + \mcBR-Y2\\ 

B^cW (UBR: - Ar..)Y2Y,^C:\\ + \\Ar.Y2Y,^C:\\) + 

B+c\\i\\iBR,-AR.,)Y2\\ + \\AR.Y2\\) 



< 

<l + 2 



+ 



Y 



1,C: 



Y+ 



+ 



rl(<5 + \\Ar.Y2Y^C:\\) + S^H^ + \\Ar:Y2\ 



Y 



1,C: 



\\Ar.Y2YIc4 + \\Ar:Y2\ 



By the first principle, the following holds whp: 

\\B+cBr,\\ = 0{Xy + Ay^l ||A^,y2i^2'!c:|| + UrM)- 
The same argument works for lli?:^^?^^!! . Whp, 

\\B..cB+c\\ = 0{Xx + XxS-^ \\X2,R:X^A.,c\\ + ||X2^Ac;||) 

Bounding ||-B — B-qB^^Br- || requires more work, but the basic ideas are the same. Recall 
that the second principle suggests that ||-B — B-cB^fjBR-\\ cannot be too much bigger than 
Ij-Bijc — BrcB'^^BrcW = 0. Applying Lemma 2.3 with A being B — B-cB'^^Br- yields 



(2.5) 
(2.6) 



\B - B:cBJ^(JBR^\ < 



^i,R- 




y+ 


^i,R- 




y+ 


^Ir-- 




y+ 


^tR- 




y+ 





11^2, ^^^(-B - B;cB^^Br;)YiY^q.\ 
\\Xi^R:X'[ (B - B;cB']^^Br;)Y2Y^.(j\ 

\\X2,R:X2 {B - B;cB'^^Br;)Y2Y^q.\ 

-T/D D _ D+ D_ II , 



X^{B - B.,cB+Br.,)Y2 
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which is bounded by 



Y+^.j \\Y,,C:\\ {\\X2,R:X^B\\ + \\X2,R:X^ B.,c\\ \\B+cBr.,\\) + 
Y+C-l \\Xl,R:\\ i\\BY2Y2^C:\\ + \\ B RrX^Y^^C --W \\B:CB^c\\) 

{\\X2,R:XjBY2Y^C:\\ + \\X2,R:X^ B.,c\\ \\ B R.Y2Y^c -.W) + 



X+j,.\\ \\X,,R..\\ {\\BY2\\ + \\B..cB+c\\ \\Br:Y2\\)+ 
Y+c.\\ \\Yi,C:\\ i\\xlB\\ + \\B^cBr:\\ \\X^B.,c\\) + 
X^BY2\\ + \\xJb.,c\\ \\Br.Y2\\ . 



We have paired H-'^i,/?:!! with 



X 



1,R: 



, and ||li,C:|| with 



because the first principle 



implies that their products are 0(1) whp. That means 



+ 



— 1 



\BR:Y2Y2^C:\\) + 



\\B - B.,cB+^.Br,\\ = 0{Xxi\\X2,R:X^B\\ + \\X2,R:X^B.,c\\ \\B+cBr:\\ 

Xy{\\BY2Y^C:\\ + \\BR:Y2Y^C:\\ \\B:CB+c\\) + 
Xx XYi\\X2,R:Xj BY2Y^(j.\\ + ||X2,_R:Xji3:c|| 5 

\\BY2\\ + \\B.,cB+c\\ \\Br:Y2\\ + 

\\X2B\\ + ||i?_^^i3iJ:|| ||Xj'i?:c|| + 

\\X^B.,c\\6-^\\Br.Y2\\)- 

We have dropped ||X2"i?l2|| because it is dominated by HxJ^H. (2.6) and (2.5) can be 
used to control ||i?:ci?J(^|| and Before doing that, we want to replace B with 

A in all the other terms. This will introduce some extra (5's. For example, ||X2,_R:X2"-B|| < 
||-^2,i?:-^2'-B - X2,R:Xj^|| + ||X2,i?:X|'^|| < 5+ ||X2,ij;XjA|| . Doiug the Same for othcr terms, 
we have that — B-qB^qBr-^^^ is whp 

0{Xx{6 + \\X2,R:X^A\\ + {6+ \\X2,R:XjA.,c\\) \\B+cBr:\\) + 
Xy{S+\\AY2Y^C:\\ + i^+\\^R:Y2Y2^C:\\) \\B:CB+c\\) + 
XxXy{6 + \\X2,R:X^AY2Yi^C:\\ + ||^2,i?:^2^Ac|| + 
,XJA.,c\\6-^\\Ar.Y2Y2^C:\\) + 



+ \\X: 



2,R: 



\\ArY2YIc:\. 
6+\\AY2\\ + {6+\\Ar.Y2\\)\\B:CB+c\ 



\X^A\\ + {6+\\X^A.,c\\)\\B+cBr:\\ + 



\X^A.,c\\ + \\Ar.Y2\ 



+ \\X^A.,c\\6- 



\\Ar.Y2\ 



Several terms can be dropped in the O notation, for example 6 < XxS < XxXyS and 



XiA.,c 



< 



We shall also plug in the estimates on and 
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from (2.6) and (2.5). This leads to 

0{Xx{\\X2,R:X^A\\ + {6+\\X2,R:X^A.,c\\){\y + XyS-^ \\Ar.Y2Y^C:\\ + ^'^ Pi?:^2||)) + 

XY{\\AY2Yi^C:\\ + {6 + \\Ar.Y2Y^cM^x + XxS-^ \\X2,r:X^ A.,c\\ + S'' \\X^A.,c\\))+ 

XxXy{6 + \\X2,R:X^AY2Y^C:\\ + \\X2,R:X^A.,c\\ + 
\\Ar:Y2Y^C:\\ + \\X2,R:X^A.,c\\ \\ A R,Y2Y^c -.11) + 

\\AY2\\ + {6 + \\Ar.Y2\\){Xx + Xx6-^ \\X2,r:X^ A.,c\\ + <5"^ \\xJA.,c\\)+ 
\\X^A\\ + {6 + \\X^A.,c\\){Xy + XyS-^ \\AR.Y2Yi^c:\\ + Pfi:>2||)+ 
\\X^A,c\\d-'\\AR,Y2\\)- 
Collect the terms by their Ax, Ay factors and drop the smaller terms to obtain that whp, is 

\\B - B.,cB+cBr:\\ = 0{Xx{\\X2,R:X^A\\ + \\Ar.Y2\\ + \\X2,R:X^A.,c\\ \\Ar.Y2\\) + 

Ay(||^y2^2^C7:|| + ||^2''^:C|| + \\AR.Y2Yi^C:\\ 1 1 ^2''^:C 1 1 ) + 

XxXY{d +\\X2,R:XjA.,c\\ + \\Ar.Y2Y^C:\\ + 

\\X2,R:X^A.,c\\ \\Ar.Y2Y^C:\\ + \\X2^R:X^ AY2Yi^C:\\) + 

\\X^A\\ + \\AY2\\+5~' \\X^A,c\\ \\Ar.Y2\\)- (2-7) 

We now have control over all three terms \\B — B-cB^^Br^W , || , lli?:^^]^^^!! . Sub- 

stitute (2.5), (2.6), (2.7) into (2.4). As the RHS of (2.7) dominates 6 multiplied by the RHS 
of (2.5), (2.6), we conclude that whp, ||^ — A:C7i?^^^^; || is also bounded by the RHS of (2.7). 
To obtain the basic bound, (1.4), we note that all the "normed terms" on the RHS of (2.7), 

e.g., AR-Y2Y^fj and are bounded by £k- It follows that whp, \\A — A-cB^qAr:\\ = 

0{XxXYi5 + e + 5~h^)). 

To obtain the other bound, (1.5), we need to bound each "normed term" of (2.7) differently. 
Recall (2.3). Consider ||X2,_R:XJ74;c||. See that 



-'^2,i?:-'^2'^:C = (-^2,R: , • • ■ , X\m/k'],R:) 
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A. 



[m/k],l 



A 



2,ln/k'] 



lm/k'],ln/k] 



\ ^\nlk\C: j 



Recall that at the beginning of the subsection, we show that whp 



x. 



i,R: 



0(A^^) and 



O(Ay^) for all Recall the definition of e^, in (1.3). It follows that whp, 

[m/k] [n/fc] 



\X2,R:X^A.,c\\ < Y.\\ 
i=2 j=l 



Ai 



Y, 



j,C: 



X2,R:XjAY2Y^C: 



Apply the same argument to other terms on the RHS of (2.7), e.g., 

0{X^^ Xy^e'i^) and ||X2'A:c|| = O(Ay^e^) whp. Mnemonically, a ii in the subscript leads to a 
Av^ and a C in the subscript leads to a XZ^ . 
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Algorithm 2. 0{mnk) algorithm 
Input : Matrix ^ e C™^". 

Output: A skeleton representation with exactly k rows and columns. 
Steps: 

1. Uniformly and independently select i columns to form A-c- 

2. Uniformly and independently select £ rows to form Aft:- 

3. Run RRQR on A-c to select k columns and form A-^c'- This takes 0{mk'^) 
time and 0{mk) space. 

4. Run RRQR on A^. to select k rows and form Ari-. This takes 0{nk'^) time 
and 0{nk) space. 

5. Compute Z = A'^^i{AA^,.). This takes 0{TAk+mk'^) time and 0{mk) space, 
where Ta is the time needed to apply ^ to a vector. 

6. Output A-c Z Ajii.. 



Recall that ||j4:ci?^^Aj:j; || is bounded by the RHS of (2.7). Upon simplifying, we obtain 
that II A — II = 0(AxAy(5+e'^ + AxAy(5~^e^^), that is (1.5). The proof is complete. 

3. Two other algorithms. 

3.1. Second algorithm. Algorithm 2 returns a skeleton with exactly k rows and columns. 
First, randomly select 0{k) rows and columns, then trim down to k columns and k rows 
by performing RRQR on A-c ^^nd ^/j. • For dense matrices, the most expensive step is the 
multiplication of A by A'^,.. However, for structured matrices, the most expensive step of 
Algorithm 2 may be the RRQR factorization of A-c and A^. and the inversion of A-c',Apii-, 
which take 0{mk'^) time. The overall running time is 0{TAk) + 0{mk'^) where Ta is the cost 
of matrix- vector multiplication, but for convenience, we refer to this as the 0{mnk) algorithm. 

It can be shown that once A-c, Ari- are fixed, the choice of Z = A'}^,AA'^,. is optimal in 
the Frobenius norm (but not in the 2-norm), that is Z = aigy^^^exe \\A — A-c''^^R':\\f- 
surprisingly, we have a better error estimate. 

Theorem 3.1. Let A be given by (1-1) for some k > 0. Assume m > n and Xi,Yi are 
ji-coherent. Recall the definitions of ek-,e'^ in (1.2) and (1-3). Let I > lOfiklogm. With 
probability at least 1 — 4km~'^ , Algorithm 2 returns a skeleton that satisfies 

\\A - A-cZAji'-.W = 0{ek{mk)^/^). 

Proof Let P = A-cA^-^, G C™""™. RRQR [23] selects k columns from A-,c such that 
\\A-.c - PA-,c\\ < f{k,£)ak+i{A-,c) < f{k,£)ak+i{A) < f{k,£)ek, 

where /(A;,£) := -v/l + 2k{£ — k). We have used the fact that (Tfc+i(^) = (^k+i ( "t^^ ) < 

V ^21 ^22 / 

ci(^^2) — ^f'- interlacing theorems [27, Corollary 3.1.3]. 
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Algorithm 3. 0{nk'^) algorithm 
Input : Matrix yl e C™^". 

Output: A skeleton representation with i = 0{k) rows and k columns. 
Steps: 

1. Uniformly and independently select ^ rows to form A^-. 

2. Run RRQR on Ar-. Obtain Ar- ~ Ajic'{A'^^,Ar-) where Arc" contains k 
columns of Aji-. This takes 0{nk'^) time and 0{nk) space. 

3. Compute Z = A'^^,. This takes 0{k'^) time and 0{k'^) space. 

4. Output A.c'ZAr;. 



Recall from (2.1) that y+^. = 0{{n/lY/'^) with probability at least 1 - 2km'^. Apply 
Corollary 2.4 to obtain that whp 

\\A - PA\\ < 0(Ay) \\A.,c - PA.,c\\ + 0(Ay)efe + Sk 
= 0{ek{n/e)'/^f{k,£)) = 0{ek{nkf/^). 

Let P' = A'^i.Ari.. By the same argument, \\A — AP'\\ = 0{ek{mk)^/'^) with the same failure 
probability. Combine both estimates. With probability at least 1 — 4km~'^, 

\\A- A.,c'Af^,AA+,.AR,.]\ = \\A- PAP'W 

< \\A- PA\\ + \\PA- PAP'W 

< \\A- PA\\ + \\A-AP'\\ 
= 0{ek{mkf'^). 

■ 

Many algorithms that use the skeleton A-c{A'^(jAA'^.)Aji-, e.g. [29], seek to select columns 
such that II A — 74:C'A!^A|| is small. Here, we further select k out of £ = 0{k) columns, which is 
also suggested in [5]. Their 2-norm error estimate is 0{k log^''^ k)ek-\-0{k^/^ log^''^ /c)!!^!— 
where A), is the optimal rank k approximation to A. In general \\A — j4fc||^ < {n — /c)^/^^^, so 
our bound is a factor k^^^ better. Our proof is also more straightforward. Nevertheless, we 
make the extra assumption that Xi,Yi are incoherent. 

3.2. Third algorithm. Consider the case where only Xi is 0(l)-coherent. See Algorithm 
3. It computes a skeleton with 0{k) rows and k columns in 0{nk'^ + k^) time. Intuitively, 
the algorithm works as follows. We want to select k columns of A but running RRQR on A is 
too expensive. Instead, we randomly choose 0{k) rows to form Ar-, and select our k columns 
using the much smaller matrix Ar:. This works because Xi is assumed to be 0(l)-coherent 
and choosing almost any 0{k) rows will give us a good sketch of A. 

Theorem 3.2. Let A be given by (1-1) for some A; > 0. Assume m>n and Yi is fi-coherent. 
(There is no assumption on Xi.) Recall the definition of Sk in (1-2). Let i > lO/i/clogm. 
Then, with probability at least 1 — 2km~'^, Algorithm 3 returns a skeleton that satisfies 

\\A- A.,c'ZAr:\\ = Oiskimn)^/^). 



14 



Proof. We perform RRQR on Aji-, to obtain Aji- ~ Arc'D where D = A~^ij,Aji- and 
C indexes the selected k columns. We want to use the second principle to "undo the row 
restriction" and infer that A ~ Aq'D, the output of Algorithm 3. We now fill in the details. 

Strong RRQR [23] guarantees that 



and 



\Ar: - Anc'DW < ak+i{AR.,)f{k,n) < ak+i{A)f{k,n) < ekf{k,n) 



\D\\ < f{k,n) 



where /(/c, n) = \/l + 2k{n — k). Prepare to apply a transposed version of Corollary 2.4, that 
is 



\A- APW < 



X 



1,R: 



Ar: - Ar.P\\ + 



\I - P\\ \\X2,R:X^A\\ + \\I - P\\ \\X^A\ 



(3.1) 

Let P = Sc'D, so that ||P|| < ||i:>|| < f{k,n). Note that AP = A.c'A%^^,Ar.. By (2.1), with 



probability at least 1 — 2km ^, 



X 



1,R: 



0((m/^)i/2). By (3.1), 



\A - AP\\ < 0(Ax) \\Ar., - Arc'DW + 0(Ax)(l + \\P\\)ek + (1 + \\P\\)ek 
= 0{ekf{k,n){m/ifl^) = 0{ek{mnf/^). 



If Xi is not incoherent and we fix it by multiplying on the left by a randomized Fourier 
matrix TD (cf. Section 1.6), then we arrive at the algorithm in [28]. The linear algebraic 
part of their proof combined with the first principle will lead to the same bounds. What we 
have done here is to decouple the proof into three simple parts: (1) show that Xi := FDXi 
is incoherent, (2) use the first principle to show that Xi^r- is "sufficiently nonsingular" , (3) 
use the second principle to finish up the linear algebra. 



3.3. Comparing the three algorithms. 

in this paper. Assume m > n. 



Here is a summary of the three algorithms studied 





No. rows 


No. columns 


Error in 2-norm 


Running time 


Memory 


Algorithm 1 
Algorithm 2 
Algorithm 3 


i = 0{k) 
k 

£ = 0{k) 


l = 0{k) 
k 
k 


0(efc(mn)V2/^) 
0(efc(mfc)V2) 
0(efc(mn)^/2) 


0(P) 
0{TAk) + 0(mP) 
0(nfc2) 


0(ifc2) 

0{mk) 
d{nk) 



Suppose A has structure and Ta <^ mn. If we can tolerate a m factor in the running time 
and memory usage, then Algorithm 2 is the method of choice. Otherwise, we recommend 
using Algorithm 1. It is much faster and even though Theorem 1.2 suggests that the error 
grows with (mn)^/^, we believe that in practice, the error usually grows with m^/^. 

Suppose A is dense and has no structure. In this case, Algorithm 2 is too slow. As for 
Algorithm 3, it runs significantly slower than Algorithm 1 and its error bounds are not any 
better. However, if we do not want to worry about setting the parameter 6, we will prefer to 
use Algorithm 3. 
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Figure 4.1. Loglog plot of the empirical mean of the error (in 2-norm) by the 0{k^) algorithm versus 5, a 
regularization parameter. This relationship between the error and 5 agrees with Theorem 1.2. See (4-.1). More 
importantly, the error blows up for small 5, which implies that the regularization step is essential. 

4. Examples. For this section, we consider only Algorithm 1, the 0{k^) algorithm. 

4.1. Toy example. Let A = X'EY'^ € C"^" where X, Y are unitary Fourier matrices and 
S is a diagonal matrix of singular values. Note that every entry of X, Y is of magnitude 
and are 1-coherent. 

For the first experiment, A is 301 x 301 and e = = (yk+i = • • • = cr„ = 10"^^. We 
also pick the first k singular values to be logarithmically spaced between 1 and Ek- In each 
trial, we randomly pick ^ = 100 rows and columns and measure ||^ — A-cZAji-\\. The only 
parameters being varied are k and 6. 

From (1.4) in Theorem 1.2, we may expect that when variables such as n, m, i, k are fixed, 

log \\A - A.,cZAr., II cc log{6~\ek + 5)^) = - log 5 + 2 log{ek + 6). (4.1) 

Consider a plot of || A — A-c^Aji: \\ versus (5 on a log-log scale. According to the above equation, 
when (5 <C e/c, the first term dominates and we expect to see a line of slope —1, and when 
5 ^ Ek, log(efc + (^) — log 5 and we expect to see a line of slope +1. Indeed, when we plot the 
experimental results in Figure 4.1, we see a right-angled 1^-curve. 

A curious feature of Figure 4.1 is that the error curves behaves like a staircase. As we 
decrease k, the number of different error levels decrease proportionally. A possible explanation 
for this behavior is that the top singular vectors of A-c niatch those of A, and as 6 increases 
from cri{A) to ai.-i{A) for some small i, smaller components will not be inverted and the error 
is all on the order of o'i(A). 
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e— £=10"^ 




Figure 4.2. Loglog plot of the empirical mean of the error (in 2-norm) by the 0{k^) algorithm versus n, 
the size of square matrix A. Here, k, I are fixed, A — XiY^ + eX2Y2 and £k = Sk+i = ...=£. The parameter 
5 is set at £{£/n^'''^). The error is roughly proportional to n^^^e. 

For the second experiment, we use the same matrix but vary n instead of 6. We fix A; = 9, 
£ = 40 and 6 = efc£/n^/^. There are three different runs with = e = 10~^, 10~^, 10"^". 
The results are plotted in Figure 4.2. They suggest that the error \\A — A-c^Ar-W scales with 
n^/^, which is better than (1.6). 

4.2. Smooth kernel. Consider a ID integral operator with a kernel K that is analytic 
on [—1,1]^. Define A as {A)ij = cK{xi,yj) where the nodes xi,...,Xn and yi,...,yn are 
uniformly spaced in [—1, 1]. First, suppose K = X]i<jj<6Cij 

10~^A^ where Ti{x) is the i-th Chebyshev polynomial and A'^ is the random Gaussian matrix, 
i.e., noise. The coefficients Cjj's are chosen such that ||^|| ~ 1. Pick n = m = 10^ and slowly 
increase i, the number of rows and columns sampled by the 0(A:^) algorithm. As shown in 
Figure 4.3, the skeleton representation A-c^An- converges rapidly to A as we increase I. 

Next, consider K{x,y) = cexp(xy). Let n = 900 and pick c to normalize ||^|| = 1. We 
then plot the empirical mean of the error of the 0{k^) algorithm against £ on the left of Figure 
4.4. Notice that the error decreases exponentially with £. 

To understand what is happening, imagine that the grid is infinitely fine. Let {pi,ip2, ■ ■ ■ 
be Legendre polynomials. Recall that these polynomials are orthogonal on [—1, 1]. Define the 
matrices X,Y as {X)ij = (pj{xi) and {Y)ij = (pj{yi). Assume the (pis are scaled such that 
X, Y are unitary. It is well-known that if we expand K in terms of Chebyshev polynomials or 
Legendre polynomials [6] or prolate spheroidal wave functions [41], the expansion coefficients 
will decay exponentially. This means that the entries of X^ AY should decay exponentially 
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log^Qabs(A) 



log^oabs(A^ZAp) (1=12) 
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Figure 4.3. A is the smooth kernel K{x,y) where K is the sum of 6^ low degree Chebyshev polynomials 
evaluated on a 10^ x 10^ uniform grid. The topleft figure is A while the other figures show that the more intricate 
features of A start to appear as we increase £ from 12 to 18 to 24. Recall that we sample £ rows and £ columns 
in the 0{k^) algorithm. 



away from the topleft corner and = @{£k) (cf- (1-2) and (1-3)). We confirm this by plotting 
Skj^'k versus k on the right of Figure 4.4. The actual X, Y used to obtain this plot are obtained 
by evaluating the Legendre polynomials on the uniform grid and orthonormalizing. It can be 
verified that the entries of X, Y are of magnitude 0{{k/nY^'^) which implies a coherence ~ /c, 
independent of n. The implication is that the algorithm will continue to perform well as n 
increases. 

As I increases, we can apply Theorem 1.2 with a larger k. Since efc,e^ decrease exponen- 
tially, the error should also decrease exponentially. However, as k increases beyond ~ 15, £k 



18 



^5=10" 
^8=10" 
^5=10 




-11 



10 



10" 



10-^° 



50 



100 



150 



10 



10 



-15 



-20 




10 



15 



Figure 4.4. A is the smooth kernel K(x,y) = exp(— a;t/) sampled on a uniform grid. The 



on the left 



shows that the error of the 0{k ) algorithm decreases exponentially with £, the number of sampled rows and 
columns. The figure on the right shows that if we expand A in terms of Legendre polynomials, the coefficients 
(and therefore ek,s'k) decay exponentially. See (1.1), (1-2) and (1.3) for the definitions of eu and e'j.. 



stagnates and nothing can be gained from increasing I. In general, as decreases, we should 
pick a smaller 5. But when A; > 15, choosing a smaller 5 does not help and may lead to worse 
results due to the instability of pseudoinverses and floating point errors. This is evident from 
Figure 4.4. 

Platte, Trefethen, Kuijlaars [32] state that if we sample on a uniform grid, the error of 
any stable approximation scheme cannot decrease exponentially forever. In this example, the 
random selection of columns and rows correspond to selecting interpolation points randomly 
from a uniform grid, and 5 serves as a regularization parameter of our approximation scheme. 
The method is stable, but we can only expect an exponential decrease of the error up to a 
limit dependent on 5. 

4.3. Fourier integral operators. In [11], Candes et al. consider how to efficiently apply 
2D Fourier integral operators of the form 

where /(^) is the Fourier transform of /, a(x,i^) is a smooth amplitude function and <^ is a 
smooth phase function that is homogeneous, i.e., $(rE, A.^) = A$(a;,^) for any A > 0. Say there 
are A^^ points in the space domain and also the frequency domain. 

The main idea is to split the frequency domain into \/iV wedges, Taylor expand $(a;, •) 
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about 1^1 where j indexes a wedge, and observe that the residual phase ^j{x, ^) := <I>(x, ^) — 
^(^> I'^l ij) ■ C is nonoscillatory. Hence, the matrix A^J^ := ex.p{2Tri^j{xs,Ct)) can be approxi- 
mated by a low rank matrix, i.e., exp(27ri$j(x, ^)) can be written as "^^^i fq{x)gg{S,) where 
r, the separation rank, is independent of N. By switching order of summations, the authors 
arrive at 0{N^'^) algorithms for both the preprocessing and the evaluation steps. See [11] for 
further details. 

What we are concerned here is the approximate factorization of A^^\ This is a N"^ by 
A^-^'^ matrix since there are N"^ points in the space domain and N'^/^/N points in one wedge 
in the frequency domain. In [11], a slightly different algorithm is proposed. 

1. Uniformly and randomly select i rows and columns to form and A-c- 

2. Perform SVD on A-c- Say A-^c = UiAiV^ + 112^2^2 where U,V are unitary and 
11^2 II < (5, a user specified parameter. 

3. Return the low rank representation UiU^^.Aji-. 

In the words of the authors, "this randomized approach works well in practice although we 
are not able to offer a rigorous proof of its accuracy, and expect one to be non-trivial" [11]. 
We believe this is because A does satisfy the assumptions of this paper. See (1.1), (1-2). 

To see why, let i? be a perturbation of A such that B-c = UiKiV^ and \\A — B\\ < 6. 
Since Ai is invertible, the output can be rewritten as 

UiU+^Ar; = B.,cB+cAr.,. 

It is easy to adapt the proof of Theorem 1.2 to bound ||^ — B-cBj^^Ar- || — just replace 
(2.4) with the following. The rest of the proof is the same. 

11^ - B.,cB+cAn:\\ < \\A - B\\ + \\B - B.,cB^cBr:\\ + \\B..cB^cBr: - B.,cB^cAr.,\\ 
<6+\\B- B.,cB+cBr:\\ + \\B:cB^c\\ 

An important subclass of Fourier integral operators is pseudodifferential operators. These 
are linear operators with pseudodifferential symbols that obey certain smoothness conditions 
[36]. In Discrete Symbol Calculus [15], a similar randomized algorithm is used to derive low 
rank factorizations of such smooth symbols. It is likely that the method works well here in 
the same way as it works well for a smooth kernel as discussed in the previous section. 
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